RWKV v6: RWKV_WKV op CUDA implementation #9454

MollySophia · 2024-09-12T16:02:00Z

Added the RWKV_WKV CUDA impl and a test_case in test-backend-ops.cpp.
Also added unary op exp for cuda so that the rwkv v6 graph can be less splited when running on a gpu.

The kernel is modified from https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/cuda/rwkv6.cu and added support for batched inference.
~~Gonna add some speed and other test results later tomorrow.~~

Prompt:

RWKV (pronounced as RWaKuV) is an RNN with GPT-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable).\nRWKV is an Open Source, non profit group, under the linux foundation. Supported by our sponsors.\nSo it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free sentence embedding. Moreover it's 100% attention-free.\n

Here's the speed comparasion between the original and the PR version.
The test is done on my weird 12900HK ES + RTX4090 PC, which is relatively CPU-bound. The tests are all using FP16, offloading all layers to GPU. Prompt length = 107, generation length = 1000.

Parameter count	Prefill (before)	Prefill (after)	Decode (before)	Decode (after)
World-1.6B	678.72 tps	1375.48 tps	57.92 tps	142.13 tps
World-3B	481.70 tps	1262.74 tps	39.46 tps	86.00 tps
World-7B	340.25 tps	1053.80 tps	26.38 tps	47.39 tps

Here's the perplexity comparasion between the original and the PR version. Tested on wikitext-2 using FP16, offloading all layers to GPU.

Parameter count	Perplexity (before)	Perplexity (after)
World-1.6B	10.8599 +/- 0.07657	10.8604 +/- 0.07657
World-3B	9.3254 +/- 0.06322	9.3256 +/- 0.06322
World-7B	7.9571 +/- 0.05213	7.9570 +/- 0.05213

test-backend-ops perf tests:

Backend 1/2 (CPU)
  Backend name: CPU
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=1,n_seqs=1):             7826 runs -    40.90 us/run -     1072 kB/run -   24.99 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=1):            3629 runs -   511.39 us/run -     2312 kB/run -    4.31 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=4):             910 runs -  2195.17 us/run -     9224 kB/run -    4.01 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=128,n_seqs=4):            342 runs -  8184.47 us/run -    24584 kB/run -    2.86 GB/s
  Backend CPU: OK

Backend 2/2 (CUDA0)
  Backend name: CUDA0
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=1,n_seqs=1):             8192 runs -     6.86 us/run -     1072 kB/run -  149.05 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=1):            8192 runs -    22.27 us/run -     2312 kB/run -   99.00 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=4):            3638 runs -    19.77 us/run -     9224 kB/run -  444.87 GB/s
  RWKV_WKV(type=f32,head_count=32,head_size=64,n_seq_tokens=128,n_seqs=4):           1365 runs -    71.58 us/run -    24584 kB/run -  327.55 GB/s
  Backend CUDA0: OK

I have read the contributing guidelines
Self-reported review complexity:
- Medium

TODO:

Add speed comparision
Add perplexity comparision

Signed-off-by: Molly Sophia <[email protected]>

uniartisan · 2024-09-13T02:17:00Z

Hi, Molly.
I have some small suggestions. Can you rename the wkv kernel to wkv6? As far as I know, wkv7 will be released soon.
Maybe we can move forward and implement fp16 calculations（Maybe next pr）

Signed-off-by: Molly Sophia <[email protected]>

MollySophia · 2024-09-14T05:01:26Z

Hi, Molly. I have some small suggestions. Can you rename the wkv kernel to wkv6? As far as I know, wkv7 will be released soon. Maybe we can move forward and implement fp16 calculations（Maybe next pr）

Yes. However I guess this is not that urgent. That can also be done after RWKV v7 is released, in the initial rwkv v7 support PR in the future.

MollySophia · 2024-09-22T01:18:03Z

Hi! @ggerganov
In case you forgot about this PR :D

ggml/src/ggml-cuda.cu

Signed-off-by: Molly Sophia <[email protected]>

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

* ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <[email protected]> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

MollySophia marked this pull request as draft September 12, 2024 16:02

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs labels Sep 12, 2024

MollySophia force-pushed the wkv-cuda branch 2 times, most recently from 7dd075a to 8ec53bf Compare September 12, 2024 23:21

ggml: CUDA unary op EXP

0b174ab

Signed-off-by: Molly Sophia <[email protected]>

MollySophia force-pushed the wkv-cuda branch from 8ec53bf to 19f2a61 Compare September 13, 2024 00:29

MollySophia marked this pull request as ready for review September 13, 2024 01:39

MollySophia marked this pull request as draft September 13, 2024 02:19

ggml: rwkv_wkv op CUDA impl

7c39f2d

Signed-off-by: Molly Sophia <[email protected]>

MollySophia force-pushed the wkv-cuda branch from 19f2a61 to 7c39f2d Compare September 13, 2024 02:41

MollySophia marked this pull request as ready for review September 13, 2024 02:52

ggerganov approved these changes Sep 16, 2024

View reviewed changes

Merge branch 'master' into wkv-cuda

caeba15

slaren approved these changes Sep 22, 2024

View reviewed changes

slaren merged commit 2a63caa into ggerganov:master Sep 22, 2024
53 checks passed

CISC reviewed Sep 22, 2024

View reviewed changes

ggml/src/ggml-cuda.cu Show resolved Hide resolved

MollySophia added a commit to MollySophia/llama.cpp that referenced this pull request Sep 22, 2024

Fix merge error in ggerganov#9454

49639b6

Signed-off-by: Molly Sophia <[email protected]>

MollySophia mentioned this pull request Sep 22, 2024

ggml: RWKV_WKV: Fix merge error in #9454 #9589

Merged

2 tasks

slaren pushed a commit that referenced this pull request Sep 22, 2024

Fix merge error in #9454 (#9589)

912c331

Signed-off-by: Molly Sophia <[email protected]>

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

Fix merge error in ggerganov#9454 (ggerganov#9589)

1afbbea

Signed-off-by: Molly Sophia <[email protected]>

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

Fix merge error in ggerganov#9454 (ggerganov#9589)

8940998

Signed-off-by: Molly Sophia <[email protected]>

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

Fix merge error in ggerganov#9454 (ggerganov#9589)

f6e0b9a

Signed-off-by: Molly Sophia <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RWKV v6: RWKV_WKV op CUDA implementation #9454

RWKV v6: RWKV_WKV op CUDA implementation #9454

MollySophia commented Sep 12, 2024 •

edited

Loading

uniartisan commented Sep 13, 2024 •

edited

Loading

MollySophia commented Sep 14, 2024

MollySophia commented Sep 22, 2024

RWKV v6: RWKV_WKV op CUDA implementation #9454

RWKV v6: RWKV_WKV op CUDA implementation #9454

Conversation

MollySophia commented Sep 12, 2024 • edited Loading

uniartisan commented Sep 13, 2024 • edited Loading

MollySophia commented Sep 14, 2024

MollySophia commented Sep 22, 2024

MollySophia commented Sep 12, 2024 •

edited

Loading

uniartisan commented Sep 13, 2024 •

edited

Loading